Platelet RNA enables accurate detection of ovarian cancer: an intercontinental, biomarker identification study

Abstract Platelets are reprogrammed by cancer via a process called education, which favors cancer development. The transcriptional profile of tumor-educated platelets (TEPs) is skewed and therefore practicable for cancer detection. This intercontinental, hospital-based, diagnostic study included 761 treatment-naïve inpatients with histologically confirmed adnexal masses and 167 healthy controls from nine medical centers (China, n = 3; Netherlands, n = 5; Poland, n = 1) between September 2016 and May 2019. The main outcomes were the performance of TEPs and their combination with CA125 in two Chinese (VC1 and VC2) and the European (VC3) validation cohorts collectively and independently. Exploratory outcome was the value of TEPs in public pan-cancer platelet transcriptome datasets. The AUCs for TEPs in the combined validation cohort, VC1, VC2, and VC3 were 0.918 (95% CI 0.889–0.948), 0.923 (0.855–0.990), 0.918 (0.872–0.963), and 0.887 (0.813–0.960), respectively. Combination of TEPs and CA125 demonstrated an AUC of 0.922 (0.889–0.955) in the combined validation cohort; 0.955 (0.912–0.997) in VC1; 0.939 (0.901–0.977) in VC2; 0.917 (0.824–1.000) in VC3. For subgroup analysis, TEPs exhibited an AUC of 0.858, 0.859, and 0.920 to detect early-stage, borderline, non-epithelial diseases and 0.899 to discriminate ovarian cancer from endometriosis. TEPs had robustness, compatibility, and universality for preoperative diagnosis of ovarian cancer since it withstood validations in populations of different ethnicities, heterogeneous histological subtypes, and early-stage ovarian cancer. However, these observations warrant prospective validations in a larger population before clinical utilities.

supernatant was transferred into a new Eppendorf tube with 0.3 mL chloroform/isoamyl alcohol (24:1). The mix was shaken vigorously for 15 s and then centrifuged at 12 000 × g for 10 min at 4 . The upper aqueous phase containing RNA was transferred in to a new tube with an equal volume of isopropyl alcohol and centrifuged at 12 000 × g for 20 min at 4 . After discarding the supernatant, the RNA pellet was washed twice with 1 mL 75% ethanol, and the mix was centrifuged at 12 000 × g for 3 min at 4 to co llect residual ethanol, followed by air-drying of the pellet for 5-10 min in the biosafety cabinet. Finally, 25~100 μL of DEPC-treated water was added to dissolve the RNA pellet. Subsequently, total RNA was qualified and quantified using a Nano Drop spectrophotometer and an Agilent 2100 bioanalyzer (Thermo Fisher Scientific, MA, USA).
For samples with total RNA＜50 nanogram, total RNA was extracted from platelets using the RNeasy Micro Kit (QIAGEN, 74004) in accordance with the manufacturer's instructions.
Appropriate platelets were ground to powder with liquid nitrogen and then transferred into a new tube with an appropriate volume of Buffer RL and 1 volume 70% ethanol. The mixture was transferred into a RNeasy MinElute spin column and centrifuged at ≥ 8000 × g for 15 s.
After discarding the flow-through, Buffer RW1, DNase I, Buffer RPE, and 80% ethanol were added and then sequentially centrifuged. The RNeasy MinElute spin column containing RNA was placed in a new 2-mL collection tube and centrifuged with lid opened at 12 000 × g for 5 min to dry the membrane and then transferred to a new 1.5-mL tube with 14 μL RNase-free water. Finally, the tubes were centrifuged for 1 min at 12 000 × g to elute the RNA. Total RNA was qualified and quantified using a Nano Drop and Agilent 2100 bioanalyzer (Thermo Fisher Scientific, MA, USA).
For samples in the discovery cohort, DNase I was used to digest double-and single-strand DNA in total RNA. Thereafter, magnetic beads were purified to recover the reaction products.
The RNase Hor Ribo-Zero method (human, mouse, plants) (Illumina, San Diego, CA, USA) was used to eliminate rRNA. Purified mRNA was fragmented into small pieces using fragment buffer. Thereafter, the first-strand cDNA was generated in the First Strand Reaction System via PCR, and the second strand of cDNA was also generated. The reaction product was purified using magnetic beads. A-Tailing Mix and RNA Index Adapters were added for end repair. The cDNA fragments with adapters were amplified via PCR and the products were purified via Ampure XP Beads. The quality and quantity of the library were assessed via two methods to ensure the high quality of the sequencing data: one method involved assessing the distribution of the fragment sizes using the Agilent 2100 bioanalyzer; the other method involved quantifying the library via real-time quantitative PCR. The qualified library was amplified on cBot to generate the cluster on the flowcell, and the amplified flowcell would be sequenced single-end on the HiSeq4000 platform.
For samples with total RNA ＞ 50 nanogram, except those in the discovery cohort, oligo(dT)-attached magnetic beads were used to purify mRNA. Purified mRNA was fragmented with fragment buffer at 94 for 5min. Thereafter, the first strand of cDNA was generated using the First Strand reaction system via PCR and then the second strand of cDNA was generated. The reaction product was purified using Ampure XP Beads and dissolved in EB solution. The quality and quantity of the library were assessed via two methods to ensure the high quality of the sequencing data: one method involved assessing the distribution of the fragment sizes using the Agilent 2100 bioanalyzer; the other method involved quantifying the library via real-time quantitative PCR. The qualified library was amplified on cBot to generate the cluster on the flowcell. Moreover, the amplified flowcell will be sequenced single-end on the HiSeq4000 or HiSeq X-ten platform (BGI-Shenzhen, China).
For samples with total RNA between 10 picogram and 50 nanogram, the platelet RNA was amplified with oligo-dT and dNTPs, incubated at 72 , and immediately placed on ice, followed by reverse transcription to form cDNA, based on the polyA tail method. The template was switched to the 5′ end of the RNA, and full-length cDNA was generated via PCR. The Agilent 2100 bioanalyzer instrument (Agilent High Sensitivity DNA Reagents) was used to determine the average molecule length of the PCR product. The cDNA library was quantified using the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific, Waltham, MA, USA) for accurate quantification, followed by fragmentation with fragment buffer. Thereafter, the A-Tailing Mix and RNA Index Adapters were added for end repair. The cDNA fragments with adapters were amplified via PCR. The PCR products were purified using Ampure XP Beads and then were size-selected. The final library was quantitated using two methods to ensure the high quality of the sequencing data: one method involved determining the average molecule length by using the Agilent 2100 bioanalyzer instrument (Agilent DNA 12000 Reagents); the other method involved quantifying the library via real-time quantitative PCR (qPCR). The qualified libraries were amplified using cBot to generate the cluster on the flowcell. The amplified flowcell was sequenced single-end on the HiSeq4000 platform (BGI-Shenzhen, China).

Data normalization and batch effect removal
In the normalization process, raw read counts of training cohort were subjected to "Variance Stabilizing Transformation" with parameter "blind=FALSE" for normalization and "Dispersion Function" for dispersion estimation by using R-package DESeq2. [3] For the validation cohorts, we assigned the estimated dispersion values from the training cohort as their dispersion and used the same method to normalize them. To exclude samples with low inter-sample correlation, we used the "Bigcor" function of R-package propagate to perform Pearson correlation, yielded one sample with a correlation of < 0.4, which was excluded from the training cohort.
To minimize the influences of age (Supplementary Figure S6A), library size (Supplementary Figure 6B), and known batches for further classification, we investigated these potential confounding factors with surrogate variables identified via svaseq in R-package sva with default parameters. [4] Each estimated surrogate variable was correlated with the potential confounding factors in cancer or non-cancer group. The continuous variables were correlated to surrogate variables by Pearson correlation and categorical variables were compared using a two-sided Student's t-test. To prevent eliminating a surrogate variable probably correlated with the cancer or non-cancer group, the surrogate variables with a correlation P-value < 0.05 would not be adjusted. These identified confounding factors were used to adjust the normalized data by removeBatchEffect from the R-package limma. [5] The P-values between confounding factors and surrogate variables are illustrated in Supplementary Figure S6C. We compared the performance before and after eliminating confounding factors and plotted the relative log intensity (RLE) using the plotRLE function in the R-package EDASeq (Supplementary Figure S6D).

Detailed model development procedure
Four steps were applied to select genes and finally trained SVM model as described in Figure S4. In the classifier development based on RNA-Seq data, which contains small samples and many features (over 60,000 genes), conventional approach was using differential expression genes to select genes between tumour and non-tumour with hand-coded fold change > 2 and FDR < 0.05 [6]. We filtered low abundant and hypervariable genes with mapping reads and expression inqualilty. LASSO was only used to select contributing genes [7] between tumor and non-tumor to reduce high dimension as you acknowledged in the following comment. For further application of our TEPOC model, we tried to eliminate the number of genes in the model. MRMR was used to rank the genes and balance the number of genes and AUC performance [8]. Finally, the optimized number of genes was used to train the SVM model.

Sample size estimation
The sample size calculation was based on the following assumptions. According to the previous hospitalized patients in the Department of Obstetrics and Gynecology of Tongji Hospital, the ratio of ovarian cancer to non-cancer is about 0.8 (231:289) in the training cohort. We designed to achieve the superiority of tumor-educated platelets (AUC=0.9) over CA125 (AUC=0.8). Using a two-sided chi-square test, 80% power would be achieved on the two-sided significance level α=0.2. The minimum sample size was 66 (40 for ovarian cancer and 26 for case control). It was planned to include 74 patients in the validation cohort assuming a dropout rate of 10%. All participants that met the inclusion criteria would be consecutively enrolled until all cohorts reached the minimum sample size.

Validation method for Quantitative real-time (qPCR)
Total RNA was extracted using TRIzol reagent (Invitrogen, Thermo Fisher Scientific, Inc.) in accordance with standard manufacturer's protocols. qPCR was 6 performed in triplicate (n = 3) using the Bio-Rad CFX96 system with SYBR Green Supermix. The relative mRNA expression levels were calculated using the comparative Cq method 2 −ΔΔ Cq) on the basis of ACTB as the loading control.

Statistical analysis
The F1-score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. It is primarily used to compare the performance of two classifiers. The formula for the F1 score is: F1 = 2 * (precision * recall) / (precision + recall). Permutation test is a popular technique for testing a hypothesis of no effect, when the distribution of the test statistic is unknown. We permutated patient label with 5000 times to generate a random AUC distribution to test the p-value of our TEPOC AUC [9].   Supplementary Table S1. Compositions of benign adnexal masses.